Overview

Principle Component Analysis is widely used in data exploration, dimension reduction, data visualization. The aim is to transform original data into uncorrelated linear combinations of the original data while keeping the information contained in the data. High dimensional data tends to show clusters in lower dimensional view.

Clustering Analysis is another form of EDA. Here we are hoping to group data points which are close to each other within the groups and far away between different groups. Clustering using PC’s can be effective. Clustering analysis can be very subjective in the way we need to summarize the properties within each group.

Both PCA and Clustering Analysis are so called unsupervised learning. There is no response variables involved in the process.

For supervised learning, we try to find out how does a set of predictors relate to some response variable of the interest. Multiple regression is still by far, one of the most popular methods. We use a linear models as a working model for its simplicity and interpretability. It is important that we use domain knowledge as much as we can to determine the form of the response as well as the function format of the factors on the other hand.

0.1 Objectives

  • PCA
  • SVD
  • Clustering Analysis
  • Linear Regression

0.2 Review materials

  • Study Module 2: PCA
  • Study Module 3: Clustering Analysis
  • Study Module 4: Multiple regression

0.3 Data needed

  • NLSY79.csv
  • brca_subtype.csv
  • brca_x_patient.csv

1 Case study 1: Self-seteem

Self-esteem generally describes a person’s overall sense of self-worthiness and personal value. It can play significant role in one’s motivation and success throughout the life. Factors that influence self-esteem can be inner thinking, health condition, age, life experiences etc. We will try to identify possible factors in our data that are related to the level of self-esteem.

In the well-cited National Longitudinal Study of Youth (NLSY79), it follows about 13,000 individuals and numerous individual-year information has been gathered through surveys. The survey data is open to public here. Among many variables we assembled a subset of variables including personal demographic variables in different years, household environment in 79, ASVAB test Scores in 81 and Self-Esteem scores in 81 and 87 respectively.

The data is store in NLSY79.csv.

Here are the description of variables:

Personal Demographic Variables

  • Gender: a factor with levels “female” and “male”
  • Education05: years of education completed by 2005
  • HeightFeet05, HeightInch05: height measurement. For example, a person of 5’10 will be recorded as HeightFeet05=5, HeightInch05=10.
  • Weight05: weight in lbs.
  • Income87, Income05: total annual income from wages and salary in 2005.
  • Job87 (missing), Job05: job type in 1987 and 2005, including Protective Service Occupations, Food Preparation and Serving Related Occupations, Cleaning and Building Service Occupations, Entertainment Attendants and Related Workers, Funeral Related Occupations, Personal Care and Service Workers, Sales and Related Workers, Office and Administrative Support Workers, Farming, Fishing and Forestry Occupations, Construction Trade and Extraction Workers, Installation, Maintenance and Repairs Workers, Production and Operating Workers, Food Preparation Occupations, Setters, Operators and Tenders, Transportation and Material Moving Workers

Household Environment

  • Imagazine: a variable taking on the value 1 if anyone in the respondent’s household regularly read magazines in 1979, otherwise 0
  • Inewspaper: a variable taking on the value 1 if anyone in the respondent’s household regularly read newspapers in 1979, otherwise 0
  • Ilibrary: a variable taking on the value 1 if anyone in the respondent’s household had a library card in 1979, otherwise 0
  • MotherEd: mother’s years of education
  • FatherEd: father’s years of education
  • FamilyIncome78

Variables Related to ASVAB test Scores in 1981

Test Description
AFQT percentile score on the AFQT intelligence test in 1981
Coding score on the Coding Speed test in 1981
Auto score on the Automotive and Shop test in 1981
Mechanic score on the Mechanic test in 1981
Elec score on the Electronics Information test in 1981
Science score on the General Science test in 1981
Math score on the Math test in 1981
Arith score on the Arithmetic Reasoning test in 1981
Word score on the Word Knowledge Test in 1981
Parag score on the Paragraph Comprehension test in 1981
Numer score on the Numerical Operations test in 1981

Self-Esteem test 81 and 87

We have two sets of self-esteem test, one in 1981 and the other in 1987. Each set has same 10 questions. They are labeled as Esteem81 and Esteem87 respectively followed by the question number. For example, Esteem81_1 is Esteem question 1 in 81.

The following 10 questions are answered as 1: strongly agree, 2: agree, 3: disagree, 4: strongly disagree

  • Esteem 1: “I am a person of worth”
  • Esteem 2: “I have a number of good qualities”
  • Esteem 3: “I am inclined to feel like a failure”
  • Esteem 4: “I do things as well as others”
  • Esteem 5: “I do not have much to be proud of”
  • Esteem 6: “I take a positive attitude towards myself and others”
  • Esteem 7: “I am satisfied with myself”
  • Esteem 8: “I wish I could have more respect for myself”
  • Esteem 9: “I feel useless at times”
  • Esteem 10: “I think I am no good at all”

1.1 Data preparation

Load the data. Do a quick EDA to get familiar with the data set. Pay attention to the unit of each variable. Are there any missing values?

Observations:

  • Some of the income values are negative – this doesn’t seem right.

  • Many Jobs5 are missing values.

  • Some of the HeightFeet05 are negative which doesn’t make sense.

  • A number of test score values are 0, which I would assume means these people never took these exams.

  • All of the esteem columns seem to be missing no entries.

1.2 Self esteem evaluation

Let concentrate on Esteem scores evaluated in 87.

  1. First do a quick summary over all the Esteem variables. Pay attention to missing values, any peculiar numbers etc. How do you fix problems discovered if there is any? Briefly describe what you have done for the data preparation.

Observations:

  • There are no missing values.

  • The average of the possible respondent responses (1-4) is 2.5. However, all of the questions either skew agree (i.e., a 1.5 mean) or skew disagree (i.e., a 3.5 mean).

  • One possible issue is that when a respondent responses agree, their response could indicate either low or high self esteem based on the question asked. For example, a person with strong self esteem would response with a low number (1 or 2) to a question like Question 1: “I am a person of worth” but with a high number (3 or 4) to a question like Question 5: “I do not have much to be proud of.” We will solve this issue in the following step.

temp <- read.csv('data/NLSY79.csv', header = T, stringsAsFactors = F)

summary(temp$Esteem87_1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    1.00    1.00    1.38    2.00    4.00
summary(temp$Esteem87_2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0     1.0     1.0     1.4     2.0     4.0
summary(temp$Esteem87_3)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    3.00    4.00    3.58    4.00    4.00
summary(temp$Esteem87_4)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0     1.0     1.0     1.5     2.0     4.0
summary(temp$Esteem87_5)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    3.00    4.00    3.53    4.00    4.00
summary(temp$Esteem87_6)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    1.00    2.00    1.59    2.00    4.00
summary(temp$Esteem87_7)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    1.00    2.00    1.72    2.00    4.00
summary(temp$Esteem87_8)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0     3.0     3.0     3.1     4.0     4.0
summary(temp$Esteem87_9)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    3.00    3.00    3.06    4.00    4.00
summary(temp$Esteem87_10)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    3.00    3.00    3.37    4.00    4.00
  1. Reverse Esteem 1, 2, 4, 6, and 7 so that a higher score corresponds to higher self-esteem. (Hint: if we store the esteem data in data.esteem, then data.esteem[, c(1, 2, 4, 6, 7)] <- 5 - data.esteem[, c(1, 2, 4, 6, 7)] to reverse the score.)
data.esteem <- data.frame(temp$Esteem87_1,temp$Esteem87_2,temp$Esteem87_3,temp$Esteem87_4,temp$Esteem87_5,temp$Esteem87_6,temp$Esteem87_7,temp$Esteem87_8,temp$Esteem87_9,temp$Esteem87_10)

colnames(data.esteem)[1] = "q1"
colnames(data.esteem)[2] = "q2"
colnames(data.esteem)[3] = "q3"
colnames(data.esteem)[4] = "q4"
colnames(data.esteem)[5] = "q5"
colnames(data.esteem)[6] = "q6"
colnames(data.esteem)[7] = "q7"
colnames(data.esteem)[8] = "q8"
colnames(data.esteem)[9] = "q9"
colnames(data.esteem)[10] = "q10"

data.esteem[, c(1,2,4,6,7)] <- 5 - data.esteem[, c(1,2,4,6,7)]
  1. Write a brief summary with necessary plots about the 10 esteem measurements.
  • Plotting the respondents responses using a stacked bar graph we see that most respondents have higher self esteem and that very few respondents were putting 1s (less than 500) compared to 4s (greater than 10,000).

  • Question 2 has the highest mean (3.6), while Question 9 has the lowest mean (3.06).

## Warning in melt(data.esteem): The melt generic in data.table has been
## passed a data.frame and will attempt to redirect to the relevant reshape2
## method; please note that reshape2 is deprecated, and this redirection is now
## deprecated as well. To continue using melt methods from reshape2 while both
## libraries are attached, e.g. melt.list, you can prepend the namespace like
## reshape2::melt(data.esteem). In the next version, this warning will become an
## error.
## No id variables; using all as measure variables

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    3.00    4.00    3.62    4.00    4.00
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0     3.0     4.0     3.6     4.0     4.0
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    3.00    4.00    3.58    4.00    4.00
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0     3.0     4.0     3.5     4.0     4.0
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    3.00    4.00    3.53    4.00    4.00
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    3.00    3.00    3.41    4.00    4.00
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    3.00    3.00    3.28    4.00    4.00
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0     3.0     3.0     3.1     4.0     4.0
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    3.00    3.00    3.06    4.00    4.00
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    3.00    3.00    3.37    4.00    4.00
  1. Do esteem scores all positively correlated? Report the pairwise correlation table and write a brief summary.
  • There is a positive coorelation between all esteem scores. This reachnes from 0.24 (q1 and q9) to 0.7 (q1 and q2).

data.esteem
res <- cor(data.esteem)
round(res, 2)
  1. PCA on 10 esteem measurements. (centered but no scaling)
## [1] "sdev"     "rotation" "center"   "scale"    "x"
a) Report the PC1 and PC2 loadings. Are they unit vectors? Are they orthogonal? 
  • The two loadings are orthoginal with unit 1.

PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
q1 0.324 -0.445 0.107 -0.216 0.237 -0.323 0.012 -0.081 -0.011 0.687
q2 0.333 -0.428 0.055 -0.234 0.199 -0.277 0.063 -0.039 0.080 -0.721
q3 0.322 0.011 0.513 0.245 -0.172 0.007 -0.621 0.395 -0.033 -0.028
q4 0.324 -0.288 -0.109 -0.211 -0.093 0.827 0.126 0.174 0.120 0.052
q5 0.315 0.079 0.451 0.471 -0.125 0.019 0.645 -0.184 -0.059 -0.002
q6 0.347 -0.049 -0.419 0.181 -0.201 0.014 -0.195 -0.306 -0.703 -0.042
q7 0.315 0.020 -0.550 0.335 -0.275 -0.278 0.064 0.279 0.498 0.059
q8 0.280 0.362 -0.140 0.236 0.824 0.157 -0.096 0.004 0.050 -0.009
q9 0.277 0.492 -0.002 -0.516 -0.081 -0.177 0.278 0.464 -0.290 0.015
q10 0.318 0.392 0.102 -0.323 -0.222 0.026 -0.225 -0.620 0.381 0.012
##    q1    q2    q3    q4    q5    q6    q7    q8    q9   q10 
## 0.324 0.333 0.322 0.324 0.315 0.347 0.315 0.280 0.277 0.318
##      q1      q2      q3      q4      q5      q6      q7      q8      q9     q10 
## -0.4452 -0.4283  0.0115 -0.2877  0.0793 -0.0492  0.0196  0.3619  0.4917  0.3918
b) Are there good interpretations for PC1 and PC2? (If loadings are all negative, take the positive loadings for the ease of interpretation)
  • PC1 interpretation: all of the coefficients are positive and around 0.3. This means there is a positive correlation between almost oll of the variables. So a growing positive alue in PC1 means a rather uniform gorwth of values in al of the varirables and this means that as much a person has a higher value in PC1, they are likeley to have a higher response/answer (1-4) to almost all of the 10 questions. This says that the higher a person’s response (1-4) to the 10 questions, the higher their self-esteem, which makes sense. Since all of the coefficients are around 0.3, PC1 is proportional to the total of the 10 answers/responses.

  • PC 2 interpretation:

    1. How is the PC1 score obtained for each subject? Write down the formula.
  • The PC1 is the linear combination of the 10 scores which minimizes the total squared perpendicular distance.

  • for each subject, their PC1 score is 0.324 x (q1 response) + 0.333 x (q2 response) + 0.322 x (q3 response) + 0.324 x (q4 response) + 0.315 x (q5 response) + 0.347 x (q6 response) + 0.315 x (q7 response) + 0.280 x (q8 response) + 0.277 x (q9 response) + 0.318 x (q10 response).

    1. Are PC1 scores and PC2 scores in the data uncorrelated?
  • Yes the PC1 scores and PC2 scores are uncorrelated. PC1 scores has greater variance than PC2 scores.

    1. Plot PVE (Proportion of Variance Explained) and summarize the plot.
  • From a scree plot of the variances of each pc we see that a large portion of the total variance is explained by the leading principle component (0.469 of the total variance). The variances of each PC are decreasing (i.e., 0.469 from PC1 while 0.0294 from PC10).

summary(pc.10)$importance
##                          PC1   PC2   PC3    PC4    PC5    PC6    PC7    PC8
## Standard deviation     2.166 1.102 0.911 0.8090 0.7699 0.7037 0.6714 0.6346
## Proportion of Variance 0.469 0.121 0.083 0.0654 0.0593 0.0495 0.0451 0.0403
## Cumulative Proportion  0.469 0.590 0.673 0.7388 0.7981 0.8477 0.8927 0.9330
##                           PC9   PC10
## Standard deviation     0.6133 0.5421
## Proportion of Variance 0.0376 0.0294
## Cumulative Proportion  0.9706 1.0000
plot(pc.10) # variances of each pc

plot(summary(pc.10)$importance[2,], # scree plot of PVEs
     ylab="PVE",
     xlab="Number of PCs",
     pch = 16,
     main="Scree plot of PVE")

f) Also plot CPVE (Cumulative Proportion of Variance Explained). What proportion of the variance in the data is explained by the first two principal components?
  • Plotting the CPVE, we see that 0.59 of the total variance in the data is explained by the first two principal components.

plot(summary(pc.10)$importance[3,],
     ylab="Cumulative PVE",
     xlab="Number of PCs",
     pch = 16,
     main="Scree plot of Cumulative PVE")

g) PC’s provide us with a low dimensional view of the self-esteem scores. Use a biplot with the first two PC's to display the data.  Give an interpretation of PC1 and PC2 from the plot. (try `ggbiplot` if you could, much prettier!)
  • A biplot with PC1 and PC2 indicated that a) PC1 loadings are similar in magnitudes and with same signs, b) PC2 captures difference between total of question 8, 9, and 10 and total of question 1, 2, and 4. Questions 3, 5, 6, 7 have little affect, c) questions 8, 9, and 10 are highly correlated and so are questions 1, 2, and 4.

limx <- c(-0,0.03)
limy <- c(-0.05,0.05)
biplot(pc.10,
       xlim=limx,
       ylim=limy,
       main="Biplot of PC1 and PC2")

  1. Apply k-means to cluster subjects on the original esteem scores

    1. Find a reasonable number of clusters using within sum of squared with elbow rules.
  • Based on the “Elbow method” and the result of our factoextra::fviz_nbclust call, we will use k=2 to have two clusters.

set.seed(0)
factoextra::fviz_nbclust(data.esteem[,-1], kmeans, method = "wss")

b) Can you summarize common features within each cluster?

* Participants within each cluster will have had similar responses to the 10 questions as others in their cluster.

c) Can you visualize the clusters with somewhat clear boundaries? You may try different pairs of variables and different PC pairs of the esteem scores.
 as.data.frame(pc.10$x) %>%
  ggplot(aes(x=PC1, y=PC10)) +
  geom_point()+
  geom_vline(xintercept = 0) +
  geom_hline(yintercept = 0) +
  ggtitle("PC2 vs. PC1") +
  theme_bw()

pairs(pc.10$x, xlim=c(-4, 4), ylim=c(-4, 4), col=rainbow(6), pch=16)

  1. We now try to find out what factors are related to self-esteem? PC1 of all the Esteem scores is a good variable to summarize one’s esteem scores. We take PC1 as our response variable.

    1. Prepare possible factors/variables:
    • EDA the data set first.

    • Personal information: gender, education (05), log(income) in 87, job type in 87. Weight05 (lb) and HeightFeet05 together with Heightinch05. One way to summarize one’s weight and height is via Body Mass Index which is defined as the body mass divided by the square of the body height, and is universally expressed in units of kg/m². Note, you need to create BMI first. Then may include it as one possible predictor.

    • Household environment: Imagazine, Inewspaper, Ilibrary, MotherEd, FatherEd, FamilyIncome78. Do set indicators Imagazine, Inewspaper and Ilibrary as factors.

    • You may use PC1 of ASVAB as level of intelligence

# eda
esteemFactors <- data.frame(temp$Gender, temp$Education05, temp$Income87, temp$Job05, temp$Weight05, temp$HeightFeet05, temp$HeightInch05, factor(temp$Imagazine), factor(temp$Inewspaper), factor(temp$Ilibrary), temp$MotherEd, temp$FatherEd, temp$FamilyIncome78)
esteemFactors$totalHeightMeters5 <- ((esteemFactors$temp.HeightFeet05 * 12)+ esteemFactors$temp.HeightInch05) *  0.0254
esteemFactors$BMI05 <- (esteemFactors$temp.Weight05 * 0.453592) / (esteemFactors$totalHeightMeters5^2)
pc.10.loading[,1] # 0.324 0.333 0.322 0.324 0.315 0.347 0.315 0.280 0.277 0.318 
##    q1    q2    q3    q4    q5    q6    q7    q8    q9   q10 
## 0.324 0.333 0.322 0.324 0.315 0.347 0.315 0.280 0.277 0.318
esteemFactors$PC1 <- data.frame(data.esteem$q1 * 0.324 + data.esteem$q2 * 0.333 + data.esteem$q3 * 0.322 + data.esteem$q4 * 0.324 + data.esteem$q5 * 0.315 + data.esteem$q6 * 0.347 + data.esteem$q7 * 0.315 + data.esteem$q8 * 0.280 + data.esteem$q9 * 0.277 + data.esteem$q10 * 0.318)
colnames(esteemFactors)[1] = "gender"
colnames(esteemFactors)[2] = "Education05"
colnames(esteemFactors)[3] = "Income87"
colnames(esteemFactors)[4] = "Job05"
colnames(esteemFactors)[5] = "Weight05"
colnames(esteemFactors)[6] = "HeightFeet05"
colnames(esteemFactors)[7] = "HeightInch05"
colnames(esteemFactors)[8] = "Imagazine"
colnames(esteemFactors)[9] = "Inewspaper"
colnames(esteemFactors)[10] = "Ilibrary"
colnames(esteemFactors)[11] = "MotherEd"
colnames(esteemFactors)[12] = "FatherEd"
colnames(esteemFactors)[13] = "FamilyIncome78"
colnames(esteemFactors)[16] = "PC1"

data.ASVAB <- data.frame(temp$Science,temp$Arith,temp$Word,temp$Parag,temp$Number,temp$Coding,temp$Auto,temp$Math,temp$Mechanic,temp$Elec,temp$AFQT)

pc.ASVAB <- prcomp(data.ASVAB, scale=TRUE)
pc.ASVAB.loading <- pc.ASVAB$rotation
knitr::kable(pc.ASVAB.loading)
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11
temp.Science 0.328 -0.159 0.089 -0.236 0.011 -0.464 -0.476 -0.205 -0.496 -0.270 -0.007
temp.Arith 0.336 0.037 0.162 0.413 0.030 0.019 0.052 0.240 0.262 -0.702 -0.258
temp.Word 0.330 0.042 0.153 -0.471 -0.001 -0.151 -0.198 -0.070 0.705 0.209 -0.191
temp.Parag 0.310 0.174 0.225 -0.465 -0.050 0.589 0.224 0.097 -0.368 -0.048 -0.243
temp.Number 0.255 0.430 -0.377 0.144 -0.731 0.006 -0.051 -0.219 0.000 0.020 0.033
temp.Coding 0.217 0.491 -0.561 -0.081 0.603 -0.099 0.038 0.112 -0.043 -0.017 0.048
temp.Auto 0.247 -0.487 -0.413 -0.015 -0.154 0.158 -0.241 0.629 -0.014 0.162 0.043
temp.Math 0.322 0.133 0.299 0.468 0.090 -0.189 0.014 0.124 -0.203 0.596 -0.339
temp.Mechanic 0.291 -0.349 -0.188 0.265 0.247 0.456 -0.138 -0.623 0.077 0.067 -0.011
temp.Elec 0.297 -0.355 -0.154 -0.112 -0.065 -0.359 0.774 -0.127 -0.022 0.001 0.036
temp.AFQT 0.353 0.114 0.341 0.079 0.040 0.080 0.014 0.100 0.053 0.049 0.847
pc.ASVAB.loading[,1] # 0.328 0.336 0.330 0.310 0.255 0.217 0.247 0.322 0.291 0.297 0.353 
##  temp.Science    temp.Arith     temp.Word    temp.Parag   temp.Number 
##         0.328         0.336         0.330         0.310         0.255 
##   temp.Coding     temp.Auto     temp.Math temp.Mechanic     temp.Elec 
##         0.217         0.247         0.322         0.291         0.297 
##     temp.AFQT 
##         0.353
esteemFactors$PC1ASVAB <- data.frame(temp$Science*0.328+temp$Arith*0.336+temp$Word*0.330+temp$Parag*0.310+temp$Number*0.255+temp$Coding*0.217+temp$Auto*0.247+temp$Math*0.322+temp$Mechanic*0.291+temp$Elec*0.297+temp$AFQT*0.353)
b)   Run a few regression models between PC1 of all the esteem scores and suitable variables listed in a). Find a final best model with your own criterion. 

  - How did you land this model? Run a model diagnosis to see if the linear model assumptions are reasonably met. 
    
  - Write a summary of your findings. In particular, explain what and how the variables in the model affect one's self-esteem. 

We chose the model with all variables. Running a model diagnosis, we see that both the Residuals vs Fitted and qqnormal plots look fine.

fit1 <- lm(unlist(PC1) ~ unlist(PC1ASVAB), data=esteemFactors)
fit2 <- lm(unlist(PC1) ~ Education05, data=esteemFactors)
fit3 <- lm(unlist(PC1) ~ FamilyIncome78, data=esteemFactors)
fit.all <- lm(unlist(PC1) ~ unlist(PC1ASVAB) + gender + Education05 + Income87 + Job05 + Weight05 + HeightFeet05 + HeightInch05 + Imagazine + Inewspaper + Ilibrary + MotherEd + FatherEd + FamilyIncome78 + totalHeightMeters5 + BMI05 + unlist(PC1ASVAB), data=esteemFactors)

summary(fit1)
## 
## Call:
## lm(formula = unlist(PC1) ~ unlist(PC1ASVAB), data = esteemFactors)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.881 -0.966  0.007  1.046  2.861 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       9.52753    0.08392   113.5   <2e-16 ***
## unlist(PC1ASVAB)  0.01590    0.00103    15.5   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.21 on 2429 degrees of freedom
## Multiple R-squared:  0.0899, Adjusted R-squared:  0.0896 
## F-statistic:  240 on 1 and 2429 DF,  p-value: <2e-16
summary(fit2)
## 
## Call:
## lm(formula = unlist(PC1) ~ Education05, data = esteemFactors)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.481 -1.052 -0.022  1.017  2.502 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.92039    0.14069    63.4   <2e-16 ***
## Education05  0.13301    0.00995    13.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.23 on 2429 degrees of freedom
## Multiple R-squared:  0.0685, Adjusted R-squared:  0.0681 
## F-statistic:  179 on 1 and 2429 DF,  p-value: <2e-16
summary(fit.all)
## 
## Call:
## lm(formula = unlist(PC1) ~ unlist(PC1ASVAB) + gender + Education05 + 
##     Income87 + Job05 + Weight05 + HeightFeet05 + HeightInch05 + 
##     Imagazine + Inewspaper + Ilibrary + MotherEd + FatherEd + 
##     FamilyIncome78 + totalHeightMeters5 + BMI05 + unlist(PC1ASVAB), 
##     data = esteemFactors)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.949 -0.909  0.000  0.972  3.025 
## 
## Coefficients: (1 not defined because of singularities)
##                                                                                           Estimate
## (Intercept)                                                                               8.84e+00
## unlist(PC1ASVAB)                                                                          7.58e-03
## gendermale                                                                                1.20e-01
## Education05                                                                               5.01e-02
## Income87                                                                                  9.94e-06
## Job0510 TO 430: Executive, Administrative and Managerial Occupations                      3.58e-01
## Job051000 TO 1240: Mathematical and Computer Scientists                                   4.06e-01
## Job051300 TO 1560: Engineers, Architects, Surveyers, Engineering and Related Technicians -7.48e-02
## Job051600 TO 1760: Physical Scientists                                                   -9.26e-01
## Job051800 TO 1860: Social Scientists and Related Workers                                 -3.05e-01
## Job051900 TO 1960: Life, Physical and Social Science Technicians                          1.48e-01
## Job052000 TO 2060: Counselors, Sociala and Religious Workers                              1.47e-01
## Job052100 TO 2150: Lawyers, Judges and Legal Support Workers                              2.09e-01
## Job052200 TO 2340: Teachers                                                               2.12e-01
## Job052400 TO 2550: Education, Training and Library Workers                                1.98e-01
## Job052600 TO 2760: Entertainers and Performers, Sports and Related Workers                6.99e-01
## Job052800 TO 2960: Media and Communications Workers                                       2.44e-01
## Job053000 TO 3260: Health Diagnosing and Treating Practitioners                           4.02e-01
## Job053300 TO 3650: Health Care Technical and Support Occupations                         -1.22e-01
## Job053700 TO 3950: Protective Service Occupations                                         6.17e-01
## Job054000 TO 4160: Food Preparation and Serving Related Occupations                      -1.37e-01
## Job054200 TO 4250: Cleaning and Building Service Occupations                             -1.48e-01
## Job054300 TO 4430: Entertainment Attendants and Related Workers                          -6.98e-01
## Job054500 TO 4650: Personal Care and Service Workers                                      2.55e-01
## Job054700 TO 4960: Sales and Related Workers                                              2.41e-01
## Job05500 TO 950: Management Related Occupations                                           4.82e-01
## Job055000 TO 5930: Office and Administrative Support Workers                              2.71e-01
## Job056000 TO 6130: Farming, Fishing and Forestry Occupations                             -7.88e-02
## Job056200 TO 6940: Construction Trade and Extraction Workers                              5.38e-03
## Job057000 TO 7620: Installation, Maintenance and Repairs Workers                          1.11e-01
## Job057700 TO 7750: Production and Operating Workers                                       1.42e-01
## Job057800 TO 7850: Food Preparation Occupations                                           1.31e-01
## Job057900 TO 8960: Setters, Operators and Tenders                                         1.38e-01
## Job059000 TO 9750: Transportation and Material Moving Workers                            -1.08e-01
## Job059990: Uncodeable                                                                    -8.03e-02
## Weight05                                                                                 -1.34e-04
## HeightFeet05                                                                              5.68e-03
## HeightInch05                                                                             -2.11e-03
## Imagazine1                                                                               -1.02e-02
## Inewspaper1                                                                               1.60e-01
## Ilibrary1                                                                                 8.20e-02
## MotherEd                                                                                  1.24e-02
## FatherEd                                                                                 -1.71e-04
## FamilyIncome78                                                                            8.07e-07
## totalHeightMeters5                                                                              NA
## BMI05                                                                                    -3.50e-03
##                                                                                          Std. Error
## (Intercept)                                                                                5.56e-01
## unlist(PC1ASVAB)                                                                           1.40e-03
## gendermale                                                                                 7.00e-02
## Education05                                                                                1.37e-02
## Income87                                                                                   2.27e-06
## Job0510 TO 430: Executive, Administrative and Managerial Occupations                       1.72e-01
## Job051000 TO 1240: Mathematical and Computer Scientists                                    2.21e-01
## Job051300 TO 1560: Engineers, Architects, Surveyers, Engineering and Related Technicians   2.30e-01
## Job051600 TO 1760: Physical Scientists                                                     6.18e-01
## Job051800 TO 1860: Social Scientists and Related Workers                                   5.12e-01
## Job051900 TO 1960: Life, Physical and Social Science Technicians                           4.78e-01
## Job052000 TO 2060: Counselors, Sociala and Religious Workers                               2.47e-01
## Job052100 TO 2150: Lawyers, Judges and Legal Support Workers                               3.49e-01
## Job052200 TO 2340: Teachers                                                                1.98e-01
## Job052400 TO 2550: Education, Training and Library Workers                                 2.73e-01
## Job052600 TO 2760: Entertainers and Performers, Sports and Related Workers                 2.92e-01
## Job052800 TO 2960: Media and Communications Workers                                        3.67e-01
## Job053000 TO 3260: Health Diagnosing and Treating Practitioners                            2.14e-01
## Job053300 TO 3650: Health Care Technical and Support Occupations                           2.00e-01
## Job053700 TO 3950: Protective Service Occupations                                          2.28e-01
## Job054000 TO 4160: Food Preparation and Serving Related Occupations                        2.16e-01
## Job054200 TO 4250: Cleaning and Building Service Occupations                               2.17e-01
## Job054300 TO 4430: Entertainment Attendants and Related Workers                            4.10e-01
## Job054500 TO 4650: Personal Care and Service Workers                                       2.43e-01
## Job054700 TO 4960: Sales and Related Workers                                               1.80e-01
## Job05500 TO 950: Management Related Occupations                                            1.97e-01
## Job055000 TO 5930: Office and Administrative Support Workers                               1.72e-01
## Job056000 TO 6130: Farming, Fishing and Forestry Occupations                               4.31e-01
## Job056200 TO 6940: Construction Trade and Extraction Workers                               1.93e-01
## Job057000 TO 7620: Installation, Maintenance and Repairs Workers                           1.99e-01
## Job057700 TO 7750: Production and Operating Workers                                        2.34e-01
## Job057800 TO 7850: Food Preparation Occupations                                            6.16e-01
## Job057900 TO 8960: Setters, Operators and Tenders                                          1.97e-01
## Job059000 TO 9750: Transportation and Material Moving Workers                              1.96e-01
## Job059990: Uncodeable                                                                      1.20e+00
## Weight05                                                                                   1.60e-03
## HeightFeet05                                                                               8.96e-02
## HeightInch05                                                                               9.95e-03
## Imagazine1                                                                                 6.01e-02
## Inewspaper1                                                                                7.75e-02
## Ilibrary1                                                                                  6.12e-02
## MotherEd                                                                                   1.25e-02
## FatherEd                                                                                   9.29e-03
## FamilyIncome78                                                                             1.95e-06
## totalHeightMeters5                                                                               NA
## BMI05                                                                                      9.53e-03
##                                                                                          t value
## (Intercept)                                                                                15.90
## unlist(PC1ASVAB)                                                                            5.40
## gendermale                                                                                  1.72
## Education05                                                                                 3.66
## Income87                                                                                    4.38
## Job0510 TO 430: Executive, Administrative and Managerial Occupations                        2.08
## Job051000 TO 1240: Mathematical and Computer Scientists                                     1.84
## Job051300 TO 1560: Engineers, Architects, Surveyers, Engineering and Related Technicians   -0.32
## Job051600 TO 1760: Physical Scientists                                                     -1.50
## Job051800 TO 1860: Social Scientists and Related Workers                                   -0.59
## Job051900 TO 1960: Life, Physical and Social Science Technicians                            0.31
## Job052000 TO 2060: Counselors, Sociala and Religious Workers                                0.60
## Job052100 TO 2150: Lawyers, Judges and Legal Support Workers                                0.60
## Job052200 TO 2340: Teachers                                                                 1.07
## Job052400 TO 2550: Education, Training and Library Workers                                  0.72
## Job052600 TO 2760: Entertainers and Performers, Sports and Related Workers                  2.40
## Job052800 TO 2960: Media and Communications Workers                                         0.66
## Job053000 TO 3260: Health Diagnosing and Treating Practitioners                             1.87
## Job053300 TO 3650: Health Care Technical and Support Occupations                           -0.61
## Job053700 TO 3950: Protective Service Occupations                                           2.71
## Job054000 TO 4160: Food Preparation and Serving Related Occupations                        -0.64
## Job054200 TO 4250: Cleaning and Building Service Occupations                               -0.68
## Job054300 TO 4430: Entertainment Attendants and Related Workers                            -1.70
## Job054500 TO 4650: Personal Care and Service Workers                                        1.05
## Job054700 TO 4960: Sales and Related Workers                                                1.34
## Job05500 TO 950: Management Related Occupations                                             2.44
## Job055000 TO 5930: Office and Administrative Support Workers                                1.57
## Job056000 TO 6130: Farming, Fishing and Forestry Occupations                               -0.18
## Job056200 TO 6940: Construction Trade and Extraction Workers                                0.03
## Job057000 TO 7620: Installation, Maintenance and Repairs Workers                            0.56
## Job057700 TO 7750: Production and Operating Workers                                         0.61
## Job057800 TO 7850: Food Preparation Occupations                                             0.21
## Job057900 TO 8960: Setters, Operators and Tenders                                           0.70
## Job059000 TO 9750: Transportation and Material Moving Workers                              -0.55
## Job059990: Uncodeable                                                                      -0.07
## Weight05                                                                                   -0.08
## HeightFeet05                                                                                0.06
## HeightInch05                                                                               -0.21
## Imagazine1                                                                                 -0.17
## Inewspaper1                                                                                 2.06
## Ilibrary1                                                                                   1.34
## MotherEd                                                                                    0.99
## FatherEd                                                                                   -0.02
## FamilyIncome78                                                                              0.41
## totalHeightMeters5                                                                            NA
## BMI05                                                                                      -0.37
##                                                                                          Pr(>|t|)
## (Intercept)                                                                               < 2e-16
## unlist(PC1ASVAB)                                                                          7.2e-08
## gendermale                                                                                0.08608
## Education05                                                                               0.00026
## Income87                                                                                  1.2e-05
## Job0510 TO 430: Executive, Administrative and Managerial Occupations                      0.03742
## Job051000 TO 1240: Mathematical and Computer Scientists                                   0.06591
## Job051300 TO 1560: Engineers, Architects, Surveyers, Engineering and Related Technicians  0.74549
## Job051600 TO 1760: Physical Scientists                                                    0.13396
## Job051800 TO 1860: Social Scientists and Related Workers                                  0.55221
## Job051900 TO 1960: Life, Physical and Social Science Technicians                          0.75762
## Job052000 TO 2060: Counselors, Sociala and Religious Workers                              0.55171
## Job052100 TO 2150: Lawyers, Judges and Legal Support Workers                              0.54960
## Job052200 TO 2340: Teachers                                                               0.28614
## Job052400 TO 2550: Education, Training and Library Workers                                0.46870
## Job052600 TO 2760: Entertainers and Performers, Sports and Related Workers                0.01668
## Job052800 TO 2960: Media and Communications Workers                                       0.50618
## Job053000 TO 3260: Health Diagnosing and Treating Practitioners                           0.06104
## Job053300 TO 3650: Health Care Technical and Support Occupations                          0.54339
## Job053700 TO 3950: Protective Service Occupations                                         0.00688
## Job054000 TO 4160: Food Preparation and Serving Related Occupations                       0.52483
## Job054200 TO 4250: Cleaning and Building Service Occupations                              0.49666
## Job054300 TO 4430: Entertainment Attendants and Related Workers                           0.08847
## Job054500 TO 4650: Personal Care and Service Workers                                      0.29456
## Job054700 TO 4960: Sales and Related Workers                                              0.18032
## Job05500 TO 950: Management Related Occupations                                           0.01472
## Job055000 TO 5930: Office and Administrative Support Workers                              0.11566
## Job056000 TO 6130: Farming, Fishing and Forestry Occupations                              0.85489
## Job056200 TO 6940: Construction Trade and Extraction Workers                              0.97771
## Job057000 TO 7620: Installation, Maintenance and Repairs Workers                          0.57738
## Job057700 TO 7750: Production and Operating Workers                                       0.54521
## Job057800 TO 7850: Food Preparation Occupations                                           0.83131
## Job057900 TO 8960: Setters, Operators and Tenders                                         0.48162
## Job059000 TO 9750: Transportation and Material Moving Workers                             0.57978
## Job059990: Uncodeable                                                                     0.94664
## Weight05                                                                                  0.93356
## HeightFeet05                                                                              0.94946
## HeightInch05                                                                              0.83232
## Imagazine1                                                                                0.86487
## Inewspaper1                                                                               0.03950
## Ilibrary1                                                                                 0.18075
## MotherEd                                                                                  0.32191
## FatherEd                                                                                  0.98527
## FamilyIncome78                                                                            0.67833
## totalHeightMeters5                                                                             NA
## BMI05                                                                                     0.71388
##                                                                                             
## (Intercept)                                                                              ***
## unlist(PC1ASVAB)                                                                         ***
## gendermale                                                                               .  
## Education05                                                                              ***
## Income87                                                                                 ***
## Job0510 TO 430: Executive, Administrative and Managerial Occupations                     *  
## Job051000 TO 1240: Mathematical and Computer Scientists                                  .  
## Job051300 TO 1560: Engineers, Architects, Surveyers, Engineering and Related Technicians    
## Job051600 TO 1760: Physical Scientists                                                      
## Job051800 TO 1860: Social Scientists and Related Workers                                    
## Job051900 TO 1960: Life, Physical and Social Science Technicians                            
## Job052000 TO 2060: Counselors, Sociala and Religious Workers                                
## Job052100 TO 2150: Lawyers, Judges and Legal Support Workers                                
## Job052200 TO 2340: Teachers                                                                 
## Job052400 TO 2550: Education, Training and Library Workers                                  
## Job052600 TO 2760: Entertainers and Performers, Sports and Related Workers               *  
## Job052800 TO 2960: Media and Communications Workers                                         
## Job053000 TO 3260: Health Diagnosing and Treating Practitioners                          .  
## Job053300 TO 3650: Health Care Technical and Support Occupations                            
## Job053700 TO 3950: Protective Service Occupations                                        ** 
## Job054000 TO 4160: Food Preparation and Serving Related Occupations                         
## Job054200 TO 4250: Cleaning and Building Service Occupations                                
## Job054300 TO 4430: Entertainment Attendants and Related Workers                          .  
## Job054500 TO 4650: Personal Care and Service Workers                                        
## Job054700 TO 4960: Sales and Related Workers                                                
## Job05500 TO 950: Management Related Occupations                                          *  
## Job055000 TO 5930: Office and Administrative Support Workers                                
## Job056000 TO 6130: Farming, Fishing and Forestry Occupations                                
## Job056200 TO 6940: Construction Trade and Extraction Workers                                
## Job057000 TO 7620: Installation, Maintenance and Repairs Workers                            
## Job057700 TO 7750: Production and Operating Workers                                         
## Job057800 TO 7850: Food Preparation Occupations                                             
## Job057900 TO 8960: Setters, Operators and Tenders                                           
## Job059000 TO 9750: Transportation and Material Moving Workers                               
## Job059990: Uncodeable                                                                       
## Weight05                                                                                    
## HeightFeet05                                                                                
## HeightInch05                                                                                
## Imagazine1                                                                                  
## Inewspaper1                                                                              *  
## Ilibrary1                                                                                   
## MotherEd                                                                                    
## FatherEd                                                                                    
## FamilyIncome78                                                                              
## totalHeightMeters5                                                                          
## BMI05                                                                                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.19 on 2386 degrees of freedom
## Multiple R-squared:  0.143,  Adjusted R-squared:  0.127 
## F-statistic: 9.07 on 44 and 2386 DF,  p-value: <2e-16
anova(fit.all)
## Analysis of Variance Table
## 
## Response: unlist(PC1)
##                    Df Sum Sq Mean Sq F value  Pr(>F)    
## unlist(PC1ASVAB)    1    353     353  250.50 < 2e-16 ***
## gender              1      6       6    4.11   0.043 *  
## Education05         1     49      49   34.54 4.8e-09 ***
## Income87            1     40      40   28.03 1.3e-07 ***
## Job05              30     98       3    2.31 7.3e-05 ***
## Weight05            1      1       1    0.79   0.373    
## HeightFeet05        1      1       1    0.56   0.454    
## HeightInch05        1      0       0    0.02   0.883    
## Imagazine           1      0       0    0.35   0.552    
## Inewspaper          1     10      10    6.84   0.009 ** 
## Ilibrary            1      3       3    2.21   0.137    
## MotherEd            1      2       2    1.40   0.237    
## FatherEd            1      0       0    0.00   0.973    
## FamilyIncome78      1      0       0    0.17   0.680    
## BMI05               1      0       0    0.13   0.714    
## Residuals        2386   3367       1                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
par(mfrow=c(1,2))
plot(fit.all, 1)
plot(fit.all, 2)
## Warning: not plotting observations with leverage one:
##   704

2 Case study 2: Breast cancer sub-type

The Cancer Genome Atlas (TCGA), a landmark cancer genomics program by National Cancer Institute (NCI), molecularly characterized over 20,000 primary cancer and matched normal samples spanning 33 cancer types. The genome data is open to public from the Genomic Data Commons Data Portal (GDC).

In this study, we focus on 4 sub-types of breast cancer (BRCA): basal-like (basal), Luminal A-like (lumA), Luminal B-like (lumB), HER2-enriched. The sub-type is based on PAM50, a clinical-grade luminal-basal classifier.

  • Luminal A cancers are low-grade, tend to grow slowly and have the best prognosis.
  • Luminal B cancers generally grow slightly faster than luminal A cancers and their prognosis is slightly worse.
  • HER2-enriched cancers tend to grow faster than luminal cancers and can have a worse prognosis, but they are often successfully treated with targeted therapies aimed at the HER2 protein.
  • Basal-like breast cancers or triple negative breast cancers do not have the three receptors that the other sub-types have so have fewer treatment options.

We will try to use mRNA expression data alone without the labels to classify 4 sub-types. Classification without labels or prediction without outcomes is called unsupervised learning. We will use K-means and spectrum clustering to cluster the mRNA data and see whether the sub-type can be separated through mRNA data.

We first read the data using data.table::fread() which is a faster way to read in big data than read.csv().

  1. Summary and transformation

    1. How many patients are there in each sub-type?
table(brca_subtype)
## brca_subtype
## Basal  Her2  LumA  LumB 
##   208    91   628   233
b) Randomly pick 5 genes and plot the histogram by each sub-type.
num_gene <- ncol(brca)
# randomly select 10 gene
set.seed(10)
sample_idx <- sample(num_gene, 5)
# plot count number histogram for each gene
brca %>%
select(all_of(sample_idx)) %>% # select column by index
pivot_longer(cols = everything()) %>% # for facet(0)
ggplot(aes(x = value, y = ..density..)) +
geom_histogram(aes(fill = name)) +
facet_wrap(~name, scales = "free") +
theme_bw() +
theme(legend.position = "none")
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

c) Remove gene with zero count and no variability. Then apply logarithmic transform.
  1. Apply kmeans on the transformed dataset with 4 centers and output the discrepancy table between the real sub-type brca_subtype and the cluster labels.
brca_sub_kmeans <- kmeans(x = brca_sub, 4)
#saveRDS(brca_sub_kmeans, "brca_kmeans.RDS")
#brca_sub_kmeans <- readRDS("output/brca_kmeans.RDS")

table(brca_subtype, brca_sub_kmeans$cluster)
##             
## brca_subtype   1   2   3   4
##        Basal   1  17   3 187
##        Her2   39   9  26  17
##        LumA  392  82 154   0
##        LumB   98  22 111   2
  1. Spectrum clustering: to scale or not to scale?

    1. Apply PCA on the centered and scaled dataset. How many PCs should we use and why? You are encouraged to use irlba::irlba().
pca_ret <- prcomp(brca_sub, center = T, scale. = T)
pve <- summary(pca_ret)$importance[2, 1:10]
plot(pve, type="b", pch = 19, frame = FALSE)

According to the elbow rule, we should use 4 principal components as by that point the overwhelming majority of the variance is accounted for.

b) Plot PC1 vs PC2 of the centered and scaled data and PC1 vs PC2 of the centered but unscaled data side by side. Should we scale or not scale for clustering process? Why? (Hint: to put plots side by side, use `gridExtra::grid.arrange()` or `ggpubr::ggrrange()` or `egg::ggrrange()` for ggplots; use `fig.show="hold"` as chunk option for base plots)
#Centered & Scaled
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
p <- data.table(x = pca_ret$x[,1],
y = pca_ret$x[,2]) %>%
ggplot() +
geom_point(aes(x = x, y = y)) +
theme_bw() +
xlab("PC1") +
ylab("PC2") + 
ggtitle("Centered & Scaled")
p

pca_ret_centered_unscaled <- prcomp(brca_sub, center = T, scale. = F)
p2 <- data.table(x = pca_ret_centered_unscaled$x[,1],
y = pca_ret$x[,2]) %>%
ggplot() +
geom_point(aes(x = x, y = y)) +
theme_bw() +
xlab("PC1") +
ylab("PC2") + 
ggtitle("Centered but Unscaled")
p2

grid.arrange(p, p2, ncol = 2)

No, we already took the log of the data so I believe it is not necessary to scale the data since the log transform already helps handle skewness and asymmetries in the data.

  1. Spectrum clustering: center but do not scale the data

    1. Use the first 4 PCs of the centered and unscaled data and apply kmeans. Find a reasonable number of clusters using within sum of squared with the elbow rule.
#kmean_ret <- kmeans(x = pca_ret_centered_unscaled$x[, 1:4], 4)
    
wss <- function(df, k) {
  kmeans(df, k, nstart = 10)$tot.withinss
}
    
k.values <- 1:10

wss_values <- sapply(k.values, 
                     function(k) kmeans(pca_ret_centered_unscaled$x[, 1:4], centers = k)$tot.withinss)
    

# extract wss for 1:10 clusters
wss_values <- map_dbl(k.values, function(k) wss(pca_ret_centered_unscaled$x[, 1:4], k))
plot(k.values, wss_values,
type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K",
ylab="Total within-clusters sum of squares")

b) Choose an optimal cluster number and apply kmeans. Compare the real sub-type and the clustering label as follows: Plot scatter plot of PC1 vs PC2. Use point color to indicate the true cancer type and point shape to indicate the clustering label. Plot the kmeans centroids with black dots. Summarize how good is clustering results compared to the real sub-type.
#Doesnt work
kmean_ret <- kmeans(x = pca_ret_centered_unscaled$x[, 1:4], 4)

p <- data.table(x = pca_ret_centered_unscaled$x[,1],
y = pca_ret_centered_unscaled$x[,2],
col = as.factor(brca_subtype),
cl = as.factor(kmean_ret$cluster)) %>%
ggplot() +
geom_point(aes(x = x, y = y, col = col, shape = cl)) +
geom_point(data = as.data.frame(kmean_ret$centers), aes(x = PC1, y = PC2), col = "black", shape = 21, size = 5) +
scale_color_manual(labels = c("LumA", "Her2", "Basal", "LumB"), 
values = scales::hue_pal()(4)) +
scale_shape_manual(labels = c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4"),
values = c(4, 16,17,18)) +
theme_bw() +
labs(color = "Breast Cancer Sub-type", shape = "Cluster") +
xlab("PC1") +
ylab("PC2")

p

table(brca_subtype, kmean_ret$cluster)
##             
## brca_subtype   1   2   3   4
##        Basal  17   1   3 187
##        Her2    9  27  27  28
##        LumA   91 376 161   0
##        LumB   22  99 110   2

The clustering seems to be able to generally discern the 4 subtypes into groups however it is not entirely accurate. There are definitely blurred regions and significant overlap where the results are not correct.

c) Compare the clustering result from applying kmeans to the original data and the clustering result from applying kmeans to 4 PCs. Does PCA help in kmeans clustering? What might be the reasons if PCA helps?

Based on analysis of the discrepancy tables for both and the plot it seems that they agree quite well with each other. This means we can use the 3 principal components rather than all the dimensions in the original dataset. This makes the kmeans process more efficient, easier to interpret, and can lend better results.

d) Now we have an x patient with breast cancer but with unknown sub-type. We have this patient's mRNA sequencing data. Project this x patient to the space of PC1 and PC2. (Hint: remember we remove some gene with no counts or no variablity, take log and centered) Plot this patient in the plot in iv) with a black dot. Calculate the Euclidean distance between this patient and each of centroid of the cluster. Can you tell which sub-type this patient might have? 
x_patient <- fread("data/brca_x_patient.csv")
x_patient

x_patient <- log2(as.matrix(x_patient+1e-10))
x_patient_centered <- t(t(x_patient) - rowMeans(x_patient))
pc1score <- pca_ret_centered_unscaled$x[,1] * x_patient_centered

3 Simple Regression through simulations

3.1 Linear model through simulations

This exercise is designed to help you understand the linear model using simulations. In this exercise, we will generate \((x_i, y_i)\) pairs so that all linear model assumptions are met.

Presume that \(\mathbf{x}\) and \(\mathbf{y}\) are linearly related with a normal error \(\boldsymbol{\varepsilon}\) , such that \(\mathbf{y} = 1 + 1.2\mathbf{x} + \boldsymbol{\varepsilon}\). The standard deviation of the error \(\varepsilon_i\) is \(\sigma = 2\).

We can create a sample input vector (\(n = 40\)) for \(\mathbf{x}\) with the following code:

# Generates a vector of size 40 with equally spaced values between 0 and 1, inclusive
x <- seq(0, 1, length = 40)
set.seed(1)
y <- 1+1.2*x+rnorm(40,0,sd=2)

3.1.1 Generate data

Create a corresponding output vector for \(\mathbf{y}\) according to the equation given above. Use set.seed(1). Then, create a scatterplot with \((x_i, y_i)\) pairs. Base R plotting is acceptable, but if you can, please attempt to use ggplot2 to create the plot. Make sure to have clear labels and sensible titles on your plots.

3.1.2 Understand the model

  1. Find the LS estimates of \(\boldsymbol{\beta}_0\) and \(\boldsymbol{\beta}_1\), using the lm() function. What are the true values of \(\boldsymbol{\beta}_0\) and \(\boldsymbol{\beta}_1\)? Do the estimates look to be good?

From the lm() function, we get \(\boldsymbol{\beta}_0 = 1.331\) and \(\boldsymbol{\beta}_1 = 0.906\), vs true values of 1 and 1.2, so this isn’t a perfect estimate, which is a result of the normal error with SD of 2.

x <- seq(0, 1, length = 40)
set.seed(1)
y <- 1+1.2*x+rnorm(40,0,sd=2)
pairs <- data.frame(x = x, y = y)
plot(x,y,
     main="sample output vector for y according to y = 1 + 1.2x + normal error with sd=2")

fit1 <- lm(y ~ x, data=pairs)
fit1
## 
## Call:
## lm(formula = y ~ x, data = pairs)
## 
## Coefficients:
## (Intercept)            x  
##       1.331        0.906
  1. What is your RSE for this linear model fit? Is it close to \(\sigma = 2\)?

The RSE is 1.79 which is fairly close to 2 (10.3% off)

rse <- summary(fit1)$sigma
rse
## [1] 1.79
(rse-2)/2
## [1] -0.103
  1. What is the 95% confidence interval for \(\boldsymbol{\beta}_1\)? Does this confidence interval capture the true \(\boldsymbol{\beta}_1\)?
summary(fit1)
## 
## Call:
## lm(formula = y ~ x, data = pairs)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.662 -0.880  0.014  1.247  2.882 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)    1.331      0.557    2.39    0.022 *
## x              0.906      0.959    0.95    0.350  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.79 on 38 degrees of freedom
## Multiple R-squared:  0.023,  Adjusted R-squared:  -0.00272 
## F-statistic: 0.894 on 1 and 38 DF,  p-value: 0.35

The 95% CI is determined by the estimate +/- 1.96*SE. So here it would be .906 +/- 1.879: [-.973, 2.785]. Yes, this interval captures the true value of 1.2.

  1. Overlay the LS estimates and the true lines of the mean function onto a copy of the scatterplot you made above.
plot(pairs$x, pairs$y,
pch = 16,
xlab = "X",
ylab = "Y",
main = "Overlay of LS estimations and mean function")
abline(fit1, col="red", lwd=4) # many other ways.
abline(h=mean(pairs$y), lwd=5, col="blue") # add a horizontal line, y=mean(y)

3.1.3 diagnoses

  1. Provide residual plot where fitted \(\mathbf{y}\)-values are on the x-axis and residuals are on the y-axis.
plot(fit1)

  1. Provide a normal QQ plot of the residuals.

See part above

  1. Comment on how well the model assumptions are met for the sample you used.

The first assumption is linearity. The residuals seem to be relatively symmetric with respect to h=0 so this assumption is satisfied. The second assumption is homoscedasticity which also seems to be satsified as the residuals are evenly distributed across the band. Finally, the residuals seem to mostly fit the normal distribution according to the qq plot. However, there is some significant deviation at the bottom tail.

3.2 Understand sampling distribution and confidence intervals

This part aims to help you understand the notion of sampling statistics and confidence intervals. Let’s concentrate on estimating the slope only.

Generate 100 samples of size \(n = 40\), and estimate the slope coefficient from each sample. We include some sample code below, which should guide you in setting up the simulation. Note: this code is easier to follow but suboptimal; see the appendix for a more optimal R-like way to run this simulation.

# Inializing variables. Note b_1, upper_ci, lower_ci are vectors
x <- seq(0, 1, length = 40) 
n_sim <- 100              # number of simulations
b1 <- 0                   # n_sim many LS estimates of beta_1 (=1.2). Initialize to 0 for now
upper_ci <- 0             # upper bound for beta_1. Initialize to 0 for now.
lower_ci <- 0             # lower bound for beta_1. Initialize to 0 for now.
t_star <- qt(0.975, 38)   # Food for thought: why 38 instead of 40? What is t_star?

# Perform the simulation
for (i in 1:n_sim){
  y <- 1 + 1.2 * x + rnorm(40, sd = 2)
  lse <- lm(y ~ x)
  lse_output <- summary(lse)$coefficients
  se <- lse_output[2, 2]
  b1[i] <- lse_output[2, 1]
  upper_ci[i] <- b1[i] + t_star * se
  lower_ci[i] <- b1[i] - t_star * se
}
results <- as.data.frame(cbind(se, b1, upper_ci, lower_ci))
results
# remove unecessary variables from our workspace
rm(se, b1, upper_ci, lower_ci, x, n_sim, b1, t_star, lse, lse_out) 
  1. Summarize the LS estimates of \(\boldsymbol{\beta}_1\) (stored in results$b1). Does the sampling distribution agree with theory?
# Inializing variables. Note b_1, upper_ci, lower_ci are vectors
x <- seq(0, 1, length = 40) 
n_sim <- 100              # number of simulations
b1 <- 0                   # n_sim many LS estimates of beta_1 (=1.2). Initialize to 0 for now
upper_ci <- 0             # upper bound for beta_1. Initialize to 0 for now.
lower_ci <- 0             # lower bound for beta_1. Initialize to 0 for now.
t_star <- qt(0.975, 38)   # Food for thought: why 38 instead of 40? What is t_star?

# Perform the simulation
for (i in 1:n_sim){
  y <- 1 + 1.2 * x + rnorm(40, sd = 2)
  lse <- lm(y ~ x)
  lse_output <- summary(lse)$coefficients
  se <- lse_output[2, 2]
  b1[i] <- lse_output[2, 1]
  upper_ci[i] <- b1[i] + t_star * se
  lower_ci[i] <- b1[i] - t_star * se
}
results <- as.data.frame(cbind(se, b1, upper_ci, lower_ci))
## inserting from above^

summary(results$b1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -1.36    0.23    1.05    1.06    1.87    3.97
sample_sd <- sd(results$b1)/sqrt(length(results$b1))
ci_lower = mean(results$b1) - 1.96*sample_sd
ci_upper = mean(results$b1) + 1.96*sample_sd
paste("Lower CI interval: ",ci_lower)
## [1] "Lower CI interval:  0.837966630583203"
paste("Higher CI interval: ",ci_upper)
## [1] "Higher CI interval:  1.27250590645873"

We see that the mean of the sampling distribution is 1.07 while the theoretical value should be 1.2. The estimate is not perfect but we do see that true value is captured within a 95% confidence interval of the sampiling mean, therefore we could say that the sampiling distribution agrees with theory.

  1. How many of your 95% confidence intervals capture the true \(\boldsymbol{\beta}_1\)? Display your confidence intervals graphically.
sample_number = c(1:100)
p1 <- ggplot(results, aes(sample_number, b1)) + geom_point() + 
geom_errorbar(aes(ymin = lower_ci, ymax = upper_ci))

p1 + ggtitle("Confidence Interval for Each Sample")

total_including <- sum(results$lower_ci<=1.2 & results$upper_ci>=1.2)
paste("Number of Confidence Intervals including true value: ",total_including)
## [1] "Number of Confidence Intervals including true value:  96"